Querying the Deutsches Textarchiv
نویسندگان
چکیده
Historical document collections present unique challenges for information retrieval. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for conventional search architectures which typically rely on a static inverted index keyed by orthographic form. Additional steps must therefore be taken in order to improve recall, in particular for single-term bareword queries from nonexpert users. This paper describes the query processing architecture currently employed for full-text search of the historical German document collection of the Deutsches Textarchiv project.
منابع مشابه
Finite-state canonicalization techniques for historical German
Acknowledgements There are a great many people who supported and influenced this work and to whom thanks are due: First, to my advisor Peter Staudacher, who despite (or perhaps because of) his professed ambivalence to impressing others has succeeded in impressing many of his students – myself included – with a taste for the formally rigorous study of natural language; and whose patience and rep...
متن کاملCorpus Analysis based on Structural Phenomena in Texts: Exploiting TEI Encoding for Linguistic Research
This paper poses the question, how linguistic corpus-based research may be enriched by the exploitation of conceptual text structures and layout as provided via TEI annotation. Examples for possible areas of research and usage scenarios are provided based on the German historical corpus of the Deutsches Textarchiv (DTA) project, which has been consistently tagged accordant to the TEI Guidelines...
متن کاملDeveloping a BIM-based Spatial Ontology for Semantic Querying of 3D Property Information
With the growing dominance of complex and multi-level urban structures, current cadastral systems, which are often developed based on 2D representations, are not capable of providing unambiguous spatial information about urban properties. Therefore, the concept of 3D cadastre is proposed to support 3D digital representation of land and properties and facilitate the communication of legal owners...
متن کاملCanonicalizing the deutsches Textarchiv
Virtually all conventional text-based natural language processing techniques – from traditional information retrieval systems to full-fledged parsers – require reference to a fixed lexicon accessed by surface form, typically trained from or constructed for synchronic input text adhering strictly to contemporary orthographic conventions. Unconventional input such as historical text which violate...
متن کامل